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REMARKS 



A substitute specification and a proper abstract of the disclosure are 
provided herewith which make editorial changes in order to conform to standard 
US practice. A marked-up copy of the specification is also provided reflecting the 
changes made. 

In addition, the claims as filed have been canceled and replaced by new 
claims that more clearly set forth the subject matter of Applicants 1 invention. 

No new matter has been inserted into the application. 

Applicants submit that this application is in proper condition for 
examination in the United States National Stage, which action is earnestly 
solicited. 
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{Mothod and arrangomont) [- - METHOD AND ARRANGEMENT FOR DETERMINING 
A SEQUENCE OF ACTIONS FOR A SYSTEM 

BACKGROUND OF THE INVENTION 

Field of the Invention: 

This invention generally pertains to systems having states, and in particular to 
methods] for determining a sequence of actions for (a systom wh i ch has states, } [such systems. 

Discussion of the Related Art: 

A generalized method and arrangement for determining a sequence of actions for a 
system having states, wherein] a transition in state between two states {b ei ng} [is] performed on 
the basis of an actionf, is discussed by Neuneier in "Enhancing Q-Learning for Optimal Asset 
Allocation*, appearing in the Proceedings of the Neural Information Processing Systems, NIPS 
1997. Neuneier describes a financial market as an example of] {Th e inv e ntion re l at e s to a m e thod 
and an arrang e ment for d e t e rm i n i ng a soquonc e of actions for} a system which has states[. His](^a 
trans i t i on i n stato b e tw ee n two stat e s b ei ng p e rform e d on th e basi s of an act i on. 

Such a m e thod and s uch an arrang e m e nt ar e known from [1], 

A financ i a l markot is d e scr i b e d i n [1] as an e xamp le for such a systom wh i ch has stat e s. 

The} system is described as a Markov [Decision Problem (MDP).] {d e c is ion prob le m (MDP). 
Th e structur e of a sy s t e m wh i ch can b e d e scr i b e d as a Markov d e c isi on prob le m i s il lustrated i n 
F i gur e 2. 

Th e system 201 i s in a stat e xt at an i nstant t. Th e stat e xt can b e obs e rv e d by an ob se rv e r of 
th e systom. On th e bas i s of an act i on at from a s e t in th e stat e xt of poss i bl e act i ons, at ( A(xt), th e 
s y s t e m mak e s a trans i tion w i th a c e rta i n probab ili ty into a s ubs e qu e nt stat e xt+1 at a subs e qu e nt 
in s tant t+1. 




{D e scr i pt i on} [Substitute specification:] 



# # 

Th is is ill ustrat e d diagrammat i ca l ly i n Figur e 2 by a l oop. An ob 6e rvor 200 perc e ivoc 202 
obs e rvab le variab les conc e rn i ng th e 6 tat e xt and takos a doc i s i on v i a on act i on 203 with which it actc 
on tho oystom 201 . Tho oyGtom 201 i s usua ll y subj e ct to th e i nt e rfer e nc e 205. 

Furth e rmor e , th e obs e rv e r 200 obta i ns a ga i n rt 204 

wh i ch is a funct i on of tho act i on at 203 and th e orig i na l state xt at th e instant t a s w e l l as of 
th e sub se qu e nt stat e xt+1 of the system at th e subs e qu e nt i nstant t+1 . 

Th e gain rt can assum e a pos i t i v e or n e gativ e ska l ar va l u e , d e p e nd i ng on wh e th e r th e 
d e c isi on le ads, w i th r e gard to a pr e scr i bab le cr i t e r i on, to a pos i t i vo or nogotivo systom dovolopmont, 
i n [1] to an incroaso in cap i ta l stock or to a l oss. 

I n a further tim e st e p, th e ob se rv e r 200 of th e syst e m 201 dec i d e s on th e basis of th e 
obs e rvabl e variables 202, 204 of th e subs e qu e nt stat e xt+1 in favor of a n e w act i on at+1 , e tc. 

A s e qu e nc e of 

Stat e : xt ( X 
Action: at ( A(xt) 
Subs e qu e nt s tat e : xt+1 ( X 
Ga i n rt ~ r(xt, at, xt+1) ( ( 

e tc. d e scr i b e s a traj e ctory of th e syst e m wh i ch i s ovaluat e d by a p e rformanc e cr i t e rion which 
accumulat es th e i ndiv i dual ga i ns rt ov e r th e instants t. I t is assum e d by way of si mp l ificat i on i n a 
Markov dec isi on prob le m that th e stat e xt and th e action at al l conta i n i nformat i on for tho purpos e of 
d es crib i ng a transition probab ili ty p(xt+1(*) of th e syst e m from the stat e xt to th e subs e qu e nt stat e 

I n forma l torms, th i s moans that: 

p(xt+1(xt,at) d e not e s a trans i tion probab ili ty for tho subcoquont stat e xt+1 for a g i v e n stat e xt 
and g i v e n action at. 
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I n o Markov docio i on prob l om, futuro ctotoc of tho cyst e m 201 ar e thu 6 not a function of 
stat e s and act i on s wh i ch lie furth e r i n th e past than on e t i m e st e p.} 

The characteristics of a Markov {d e c i sion prob le m} [Decision Problem] are represented 
below by way of summary: 

X set of possible states of the system, 

e.g. X = 

A(x t ) set of possible actions in the state 

{p(xt+1(xt} [p(x t+1 1 xj t a t ) x t 

r(x t , a t , x t+ i) gain with expectation R(x t , a t ). 

Starting from observable variables, the variables denoted below as training data, the aim is to 
determine a strategy, that is to say a sequence of functions 

n = {^0/ HI' K , *i T } , (3) 
which at each instant t map each state into an action rule, that is to say action 
Ht( x t) = a t (4) 



Such a strategy is evaluated by an optimization function. 

The optimization function specifies the expectation, the gains accumulated [ 
]over time at a given strategy and a start state Xo. 

The so-called Q-learning method is described {in [1]} [by Neuneier] as an example of a 
method of approximative dynamic programming. 

An optimum evaluation function V*(x) is defined by 
V*(x) = max V*(x) Vx eX (5) 

71 
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where 



oo 



v7l 00 = E J Y t r(x t , ji t , x t + 1 )|xo = x 
_t = 0 



(6) 



Y {^denoting a prescribable reduction factor which is formed in accordance with the following rule: 



Y = 



1 + z 



(7) 



(8) 



O 
SI 

m 
cn 



IU 
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A Q-evaluation function Q*(x t ,a t ) is formed within the Q-learning method for each pair (state x tt 
action a t ) in accordance with the following rule: 



Q*(x t , a t ): = 2 E>(xt + l|xt' a t) * r t + 



X €X 



+Y * 2 p( x l x t' a t) ' max(<2*(x, a)) 



xeX 



a €A 



(9) 



On the basis respectively of the tupel (x t , x t+ i, a t , r t ), the Q-values Q* (x,a) are [ 
]adapted in the k+1 th iteration in accordance with the following learning rule with a prescribed 
learning rate r\ m in accordance with the following rule: 

Qjc + l(x t , a t ) = (l - T] k )Q k (x t , a t ) + ri k r t + y max(Q k (x t + x , a)) 

v a € A J 



(10) 



Usually, the so-called Q-values Q*(x,a) are approximated for various actions {a} by a function 
approximator in each case, for example a neural network or { el s e } a polynomial classifier, with a 
weighting vector w a ? which contains weights of the function approximator. 

A function approximator is {to b e und e rstood a s } , for example, a neural network, a polynomial 
classifier or (o l co) a combination of a neural network with a polynomial classifier. 
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It therefore holds that: 
Q*(x, a) * q(x; w a ) 



(11) 



Changes in the weights in the weighting vector w a are based on a temporal difference d t which 
is formed in accordance with the following rule: 

d t : = r(x t , a t , x t + i) + Y max Q(x t + 1 ; wj) - q[ x t ; w* 1 J (12) 

aeA 

The following adaptation rule for the weights of the neural network, which are included in the 
weighting vector w a , follows for the Q-learning method with the use of a neural network: 

w k + i = w k r + n k • dt • v Q(*f w k t ) • (13) 

The neural network representing the system of a financial market^ as described { i n [1],} [by 
Neuneier] is trained using the training data which describe information on changes in prices on a 
financial market as time series values. 

A further method of approximative dynamic programming {, th e so - ca l led TD(() - l e arn i ng m e thod, 
is known from [2] and i 6 e xp l a i n e d in mor e detai l in conjunct i on w i th an oxomp l ary 
e mbodim e nt.) [is the so-called TD{K) 
learning method. This method is discussed in R.S. Sutton*s, 'Learning To Predict By 
The Method Of Temporal Differences*, appearing in Machine Learning, Chapter 3, 
pages 9-44, 1988.] 

Furthermore, it is known from {[3] which} [M. Heger*s, *Risk and Reinforcement Learning: 
Concepts and Dynamic Programming*, ZKW Bericht No. 8/94, Zentrum f*r 
Kognitionswissenschaften [Center for Cognitive Sciences], Bremen University, 
December 1994, that] risk is associated with a strategy n {{}and an initial [ 
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]state x t . A method for risk (avoidment i s l ikowiso known from [3]} [avoidance is also discussed by 
Hager, cited above]. 

The following optimization function, which is also referred to as an expanded Q-function fQ(fxt> 
£Q n (xJ, a t ), is used in the [Hager] method [known from [3]} : 



m 



i , 3 

r; 2 

1U 



-si 



maximize 



2 n (*t> a t}- = r ( x t/ a t> x t + i) 



inf 
xp,xi,K 

p(xq, x 1/ K)>0 



00 



X Y kr ( x k^ *( x k)' x k + l) 
k = l 

(14) 



The expanded Q-function {Q#xtt rQ *(x,1. a t ) describes the worst case if the action a t is executed 
in the state x t and the strategy n {Qis followed thereupon. 

The optimization function fQtfxtt rQ *(x,1, a t ) for 

Q*( x t/ a t) : = max 2*( x t' a t) 

it en 

(15) « 



is given by the following rule: 



<W Xt , a t ) = min r(x t , a t , x) + y - max Q*(x, aV 

- xeX \ a€A 

p( x t + il x t^ t )>o 



(16) 



A substantial disadvantage of this mode of procedure is (to bo coon i n} that only the worst case 
is taken into account when finding the strategy. However, this [inadequately] reflects the 
requirements of the most varied technical systems (on l y to an i nadoquato oxt e nt.} [.] 
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(Furth e rmor e , i t i s known from [4] to formu l ate) [In 'Dynamic Programming and Optimal 
Control*, Athena Scientific, Belmont, 
MA, 1995, D.P. Bertsekas formulates] access control for a communications network [ 
]and {toe} routing within the communications network as a problem of dynamic [ 
]programming. 

(Th e ) [Therefore, the present] invention is {th e r e for e } based on the problem of specifying a 
method and {an arrang e mont} [system] for determining a sequence of actions {for a systom,) in 
which [the] method or { a ction} [sequences of actions achieve] an increased flexibility in 
determining the strategy {is ach i ovod.) [needed.] 

{Th e prob le m i s so l v e d by the m e thod and by th e arrang e m e nt i n accordanc e with th e f e atures 
of th e i nd e p e nd e nt pat e nt c l aims. 

I n a m e thod for comput e r - aid e d dotormination of a s e qu e nc e of actions for a syst e m which has 
stat e s, a trans i tion i n stat e b e tw ee n two stat e s b e ing p e rformed on th e bas i s of an act i on, tho 
d e t e rminat i on of th e s e quenco of act i ons is porformod i n such a way that a s e qu e nce of stat es 
resu l t i ng from tho soquonco of actions i s opt i miz e d w i th r e gard to a pr e scribed optim i zat i on funct i on, 
th e optim i zation 

function including a var i ab le param e t e r w i th th e aid of which i t is poss i b le to set a risk which th e 
r e sult i ng s e qu e nc e of stat e s has with r e sp e ct to a pr e scrib e d stat e of th e syst e m. 
An arrangemont for d e t e rm i ning} [In a method for computer-aided determination of] a sequence of 
actions for a system which has states, a transition in state between two states being performed on the 
basis of an action, {has a procossor wh i ch i s s e t up i n such a way that} the determination of the 
sequence of actions {c a n b e } [is] performed in such a way that a sequence of states resulting from 
the sequence of actions is optimized with regard to a prescribed optimization function, the 
optimization function including a variable parameter with the aid of which it is possible to set a risk 
which the resulting sequence of states has with respect to a prescribed state of the system. 
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[A system for determining a sequence of actions for a system which has states, a 
transition in state between two states being performed on the basis of an action, has a 
processor which is set up in such a way that the determination of the 
sequence of actions can be performed in such a way that a sequence of states 
resulting from the sequence of actions is optimized with regard to a prescribed optimization 
function, the optimization function including a variable parameter with the aid of which it is 
possible to set a risk which the resulting sequence of states has 
with respect to a prescribed state of the system. 

Thus, the present invention offers] (It b e comes possibl e for th e f i rst timo ow i ng to th e 
invontion to specify} a method for determining a sequence of actions at a freely prescribable level of 
accuracy when finding a strategy for a possible closed-loop control or open-loop control of the 
system, in general for influencing it. [Hence, the embodiments] (Pref e rr e d dovo l opm e nts of th e 
i nvont i on fo ll ow from th e d e pondont c l aims. 

The developments} described below are valid both for the method and for the (arrangoment, th e 
proc e ssor b e ing r e spoct i v el y cot up in the d e v e lopmont of arrangom e nt i n such a way that th e 
dovo l opmont can bo imp l om e nt e d.} [system.] 

( I n a proforrod rofinomont t a method of approx i mat i v e } [Approximative] dynamic 
programming is used for the purpose of (dotorm i notion) [determina-tion], for example a method 
based on Q-learning or { else } a method based on TD(X{0}[)]-Iearning. 

Within Q-learning, the optimization function OFQ is preferably formed in accordance with the 
following rule: 



OFQ 




{{}x denoting a state in a state space X 



{(}a denoting an action from an action space A, and 

{Qw a denoting the weights of a function approximator which belong to the action a. 



The following adaptation step is executed during Q-learning in order to determine the 
optimum weights w a of the function approximator: 



W 



t + 1 




at 




with the abbreviation 



= r ( x t< a t/ x t + i) + y max Q l 



a €A 



X t +lr W t 




x t , w 




{Qx t , x t +1 respectively denoting a state in the state space X, 
{Qa t denoting an action from an action space A ( 
y {{-{^denoting a prescribable reduction factor, 

w t* |tO denotin 9 the weighting vector associated with the action a t before the adaptation 
step, 

j | {^denoting the weighing vector associated with the action a t after the adaptation 

step, 

TU*4t(t = 1, ••■) denoting a prescribable step size sequence, 
k e {fHH-1; 1] denoting a risk monitoring parameter, 

X*{f«}denoting a risk monitoring function N K (£[) = (1*- Ksign(£))5,] (( (() " (1 (c i gn(())(, 



• VQ(*;*) denoting the derivation of the function approximator according to its weights, and 

• KM*. a *' x * + i) d en ° tin 9 a 9 ain u P° n the transition of state from the state x t to the subsequent 
state x t+1 . 

The optimization function is preferably formed in accordance with the following [ 

]rule within the TD(MO}D]-'earning method: 
OFTD = J(x;w) 
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• {Qx denoting a state in a state space X, 

• {{}a denoting an action from an action space A, and 

• {Ow denoting the weights of a function approximator. 

The following adaptation step is executed during TD(X{0}[)]-learning in order to determine the 
optimum weights w of the function approximator: 

Wfi = w t + Tifl^p *] K K (d t ) * Zt 

with the abbreviations 

d t = r(w tl a tt x t+1 ) + y{OJ(xt + i; w t ) - J(x t ; w t ) t 

Zt = ^{fM}nY*2t-i + V{{}J(x t ;w t ) t 

• {0x tl x t+1 respectively denoting a state in the state space X, 

• {^}a t denoting an action from an action space A, 

• Y {{-^denoting a prescribable reduction factor, 

• {(}w t denoting the weighting vector before the adaptation step, 

• K}w t+1 denoting the weighting vector after the adaptation step, 

• TlK-atCt = 1 , • denoting a prescribable step size sequence, 

• k € {( ( (} [-1; 1] denoting a risk monitoring parameter, 

• K K {H#denoting a risk monitoring function X K (£[) = (1*- KSign($))5,] {( (() - (1 ■ (s i gn(())(, 
H} 

• VJ(*;*) denoting the derivation of the function approximator according to its [ 
]weights, and 

• {O r (xt, at. x t+i ) denoting a gain upon the transition of state from the state x t to the subsequent 
state x t+1 . 



fSUMMARY OF THE INVENTION 



It is an object of the present invention to provide a technical system and method for 
determining a] {Th e syst e m is pr e f e rab l y a t e chn i ca l s y s t e m of wh i ch b e for e th e d e t e rm i nat i on 
moasured va l uos aro moasurod wh i ch aro usod i n dotorm i ning tho} sequence of actions [using 
measured values. 

It is another object of the present invention to provide a technical system and method 
thatjf 

Th e t e chnica l syst e m ) can be subjected to open-loop control or {else} closed-loop control with the 
use of {the} [a] determined sequence of actions. 

{Th e syst e m is pr e f e rab l y} [It is a further object of the invention to provide a technical 
system and method] modeled as a Markov {d e c i sion prob le m.} [Decision Problem.] 

{Th e m e thod or th e arrang e m e nt i s pr e f e rab l y} [It is an additional object of the invention to 
provide a technical system and method that can be] used in a traffic management system {©*}[■ 

It is yet another object of the invention to provide a technical system and method that 
can be used] in a communications system, {the} [such that a] sequence of actions {bo i ng usod in a 
communicat i ons n e twork} [is used ]to carry out access control {or a routing, that i s to say a path 
al l ocat i on.} !, routing or path allocation.] 

(Furthormoro, tho system can b e a f i nanc i a l mark e t which i s mod ele d by a Markov d e c i sion 
prob le m, th e chango i n tho financia l markot, for oxamplo tho} [It is yet a further object of the 
invention to provide a technical system and 

method for a financial market modeled by a Markov Decision Problem, wherein a] change in an 
{ 

}index of stocks[,] or {etee} [a change in] a rate of exchange on a foreign exchange market {b ei ng 
ana l yzod by us i ng tho method and/or th e arrang e m e nt and i t b ei ng} [, makes it ]possible to intervene 
in the market in accordance with {the} [a] sequence of determined actions. 



[These and other objects of the invention will be apparent from a careful review 
of the following detailed description of the preferred embodiments, which is to read in 
conjunction with a review of the accompanying drawing figures. 



BRIEF DESCRIPTION OF THE DRAWINGS! (Ex e mp la ry e mbod i m e nts of tho i nvontion aro 
illus trat e d in tho f i gure s and e xp l ained i n moro dota i l b el ow.} 



Figure 1 shows a flowchart {in wh i ch i nd i v i dua l } [of] method steps (of tho first oxomplary 

ombodimont aro ill ustrat e d} [according to the present invention]; 
Figure 2 shows a (sk e tch of a} system {wh i ch can b e } modeled as a Markov {d e cis i on 

prob le m} [Decision Problem]; 
Figure 3 shows a (sk e tch of a} communications network ( i n wh i ch} [wherein] access control is 

carried out in a switching unit [according to the present invention]; 
Figure 4 shows a (symbo l ic sk e tch of a} function approximator (w i th th e a i d of which a m e thod 

ef} [for] approximative dynamic programming { is i mpl e m e nt e d} [according to the 

present invention]; 

Figure 5 shows {a furth e r sk e tch of} a plurality of function approximators {w i th th e aid of wh i ch} 

[for] approximative dynamic programming { is i mpl e m e nt e d} [according to the 
present invention]; and 

Figure 6 shows a {sk e tch of a} traffic management system {which is} subjected to closed-loop 

control in accordance with {an e x e mp l ary e mbod i m e nt.} [the present invention.] 



(F i rst oxomp l ary e mbod i m e nt: acc e ss cont r o l and rout i n g^ [DETAILED DESCRIPTION OF THE 

PREFERRED EMBODIMENTS! 



{Figur e 3 show s } [Figure 1 shows a flowchart according to the present invention, in which 
individual method steps of a first embodiment are provided, which will be discussed 
later. 



Figure 2 shows the structure of a typical Markov Decision Problem method. 



The system 201 is in a state x t at an instant t. The state x t can be observed by 
an observer of the system. On the basis of an action a t from a set in the state x t of 
possible actions, a t e A(x t ), the system makes a transition with a certain probability 
into a subsequent state x t +1 at a subsequent instant t+1. 

As illustrated diagrammatically in Figure 2 by a loop, an observer 200 perceives 202 
observable variables concerning the state x t and takes a decision via an action 203 with which 
it acts on the system 201. The system 201 is usually subject to the interference 205. 

The observer 200 obtains a gain r t 204 
r t = r(x t , a t , x t + 1 ) e <R , (1) 

which is a function of the action a t 203 and the original state x t at the instant t as well as of the 
subsequent state x t +1 of the system at the subsequent instant t+1. 

The gain r t can assume a positive or negative scalar value depending on whether the 
decision leads, with regard to a prescribable criterion, to a positive or negative system 
development, to an increase in capital stock or to a loss. 

In a further time step, the observer 200 of the system 201 decides on the basis of the 
observable variables 202, 204 of the subsequent state x t+1 in favor of a new action a t+1 , etc. 

A sequence of 

State: x t e X 

Action: a t e A(x t ) 

Subsequent state: x t +1 e X 

Gain r t = r(x t , a t , x t+1 ) e 91 



describes a trajectory of the system which is evaluated by a performance criterion which 
accumulates the individual gains r t over the instants t. It is assumed by way of simplification in 
a Markov Decision Problem that the state x t and the action a t all contain information for the 
purpose of describing a transition probability pfx^ | *) of the system from the state x t to the 
subsequent state x t+1 . 

In formal terms, this means that: 
p( x t + l|*t' K ' x 0> a t* K. , a 0 ) = p(x t + i|x t , a t ) . (2) 

p(xt+i lx t) a t ) denotes a transition probability for the subsequent state x t+1 for a given state x t and 
given action a t . 

In a Markov Decision Problem, future states of the system 201 are thus not a function 
of states and actions which lie further in the past than one time step. 

Figure 3 shows an embodiment of the present invention involving an access control and 
routing system, such as] a communications network 300 {, wh i ch) [. 

The communications network 300] has a multiplicity of switching units 301a, 301b, 301 i, 
... 301 n, which are interconnected via connections 302a, 302b, 302j ( ... 302m. [A] {Furth e rmor e , a) 
first terminal 303 is connected to a first switching unit 301a. From the first terminal 303, the first 
switching unit 301a is sent a request message 304 which requests preservation of a prescribed 
bandwidth within the communications network 300 for the purpose of transmitting data{{}[, such as] 
video data{r} [or] text dataf)}. 

It is determined in the first switching unit 301a in accordance with a strategy described below[,] 
whether the requested bandwidth is available in the {commun i cations} [communi-cations] network 
300 on a specified, requested connection {( s t e p 305)} [instep 305]. { 
}The request is refused {(st e p 306)} [instep 306] if this is not the case. { 
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}lf sufficient bandwidth is available, it is checked in {a furth e r) checking step {(st e p 307)} [307] 
whether the bandwidth can be reserved. 



The request is refused {( s t e p 308)} [in step 308] if this is not the case. { 
}Otherwise, the first switching unit 301a selects a route from the first switching unit 301a via further 
switching units 301 i to a second terminal 309 with which the first terminal 303 wishes to 
communicate, and a connection is initialized {(step 310)) [in step 310]. 

The starting point below is a communications network 300 which comprises a set of switching 

units 

N= {l,K , n,K , N} (17) 
and a set of physical connections 

L= {l,K , 1,K , L} , (18) 
a physical connection I having a capacity of B(l) bandwidth units. 

A set 

M= {l, K , m, K , M} (19) 

of different types of service m are available, a type of service m being characterized by 
• {Qa bandwidth requirement b(m) t 



{(}an average connection time — - — , ■ and 

V(m) 



i 



• {(}a gain c(m) which is obtained whenever a call request of the corresponding type of service 
m is accepted. 

The gain c(m) is given by the amount of money which a network operator of the 
communications network 300 bills a subscriber for a connection of the type of service. Clearly, the 
gain c(m) reflects different priorities, which can be prescribed by the network operator and which he 
associates with different services. 
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A physical connection 1 can simultaneously provide any desired combination of 
communications connections as long as the bandwidth used for the communications connections 
does not exceed the bandwidth available overall for the physical connection. 

If a new communications connection of type m is requested between a first node i and a second 
node j (terminals are also denoted as nodes), the requested communications connection can, as 
represented above, either be accepted or be refused. { 

}lf the communications connection is accepted, a route is selected from a set of prescribed routes. 
This selection is denoted as a routing. b(m) bandwidth units are used in the communications 
connection of type m for each physical connection along [ 
]the selected route for the duration of the connection. 

Thus, during access control{0[» also referred to as] call admission controlQ}, a route can be 
selected within the communications network 300 only when the selected route has sufficient 
bandwidth available. { 

}The aim of the access control and of the routing is to maximize a long term gain which is obtained by 
acceptance of the requested connections. 

At an instant t, the technical system which is the communications network 300 is in a state x t 
which is described by a list of routes via existing connections, by means of which lists it is shown how 
many connections of which type of service are using the respective routes at the instant t. 

Events w, by means of which a state x t could be transferred into a subsequent state x t+1 , are the 
arrival of new connection request messages, or else the (term i nation) [termina-tion] of a connection 
existing in the communications network 300. 

In this { e x e mp l ary} embodiment, an action a t at an instant t[J owing to a connection request is 

the{ 
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}decision as to whether a connection request is to be accepted or refused and, if the connection is 
accepted, the selection of the route through the communications network 300. 

The aim is to determine a sequence of actions, that is to say clearly to determine the learning of 
a strategy with actions relating to a state x t in such a way that the following rule is maximized: 



/ oo 



Ze Pt k ■ g(x t|c , G> k , a t) J , (20) 
Vk = 0 J 

{QE{.} denoting an expectation, 

{$t k denoting an instant at which a kth event takes place, 

g(x tk ,co k/ a tk )|{(}. denoting the gain which is associated with the kth event, and 

P {f-Qdenoting a reduction factor which evaluates an immediate gain as being more valuable 
than a gain at instants lying further in the future. 

Different implementations of a strategy lead normally to different overall gains G: 

00 

G = Z e " ptk • g( x t k /^k^t k )- (2D 

k = 0 



The aim is to maximize the expectation of the overall gain G in accordance with the following 



rule J: 



J = E< 



oo 



le Ptk • g( x t k / o>k' a tJc ) 
lk = 0 



(22) 



it being possible to set a risk which reduces the overall gain G of a specific implementation of access 
control and of a routing strategy to below the expectation. 



The TD(X{{)}[)]-learning method is used to carry out the access control and the [ 
]routing.{ 
} 
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The following target function is used in this ( e x e mplary) embodiment: 



J*(x t ) = E T e""P T EoJmax g(x t , o) t , a) + J*(x t + 1 ) 




a eA 



(23) 



{QA denoting an action space with a prescribed number of actions which are respectively 



available in a state x t , 

• T {fQdenoting a first instant at which a first event co {{^occurs, and 

• {O x t+i denoting a subsequent state of the system. 

An approximated value of the target value J*(x t ) is learned and stored by employing a function 
approximator 400 (compare Figure 4) with the use of training data. 

Training data are data previously measured in the communications network 300 and relating to 
the behavior of the communications network 300 in the case of incoming connection requests 304 
and of termination of messages. This time sequence of states is stored, and these training data are 
used to train the function approximator 400 in accordance with the learning method described below. 

A number of connections of in each case one type of service m on a route of the 
communications network 300 serve in each case as input variable of the function approximator 400 
for each input 401 , 402, 403 of the function approximator 400. [ 
]These are represented {symbo li ca ll y} in Figure 4 by blocks 404, 405, 406. { 
}An approximated target value J | of the target value J* is the output variable of the function 
approximator 400. 

Figure 5 shows a detailed representation of {tfce} [a] function approximator 500, which { i n th i s 
caso) has several component function approximators 510, 520 {of th e funct i on approx i mator 500) . [ 

]One output variable is the approximated target value J |, which is formed in accordance with 
the following rule: 
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(24) 



The input variables of the component function approximators 510, 520, which are present at the 
inputs 51 1 , 512, 513 of the first component function approximator 510, or at the inputs 521 , 522 and 
523 of the second component function approximator 520 are, in turn, respectively a number of types 
of service of a type m in a physical connection r in each case, symbolized by blocks 514, 515, 516 for 
the first component function approximator, and 524, 525 and 526 for the second component function 
approximator 520. 

Component output variables 530, 531 , 532, 533 are fed to an adder unit 540, and the 
approximated target variable J | is formed as output variable of the adder unit. 

Let it be assumed that the communications network 300 is in the state x tk | and that a 

request message with which a type of service m of class m is requested for a connection { 
}between two nodes i, j reaches the first switching unit 301a. 

A list of permitted routes between the nodes i and j is denoted by R(i, j), and a list of all possible 
routes is denoted by 



as a subset of the routes R(i, j) which could implement a possible connection with regard to the 
available and requested bandwidth. 



R(i, j/ x tk ) c R(i, j) 



(25) 



For each possible route r, reR(i, j,x tk , a subsequent state x tk + l(x 




determined which results from the fact that the connection request 304 is accepted and the 



connection on the route r is made available to the requesting first terminal 303. 
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This is illustrated in Figure 1 as {s e cond) step {(st e p 102)} [102], the state of the system and the 
respective event being respectively determined in {a f i rst) step {( 6 t e p 101)} [101]. { 
}A route r* to be selected is determined in {a th i rd} step {( s t e p 103)} [103] in accordance with the 
following rule: 

r* = arg max ,j( x t k + l( x t k / °>k' r )> ©t) - < 26 > 
r€R(i, j, x tk J 

A check is made in {a furth e r} step {(st e p 104)) [104] as to whether the following rule is fulfilled: 
c(m) + j(x t)c+ i(xt k , co k/ r*l © t ) < j(x t]c ,0 t ). (27) 



If this is the case, the connection request 304 is rejected {(st e p 105)) [in step 105], otherwise 
the connection is accepted and *switched through* to the node j along the selected route r* {(st e p 
4£6»[in step 106]. 

Weights of the function approximator 400, 500 which are adapted in the TD(X{0}[)]-Iearning 
method to the training data, are stored in a parameter vector 8 {Qfor an instant t in each case, such 
that an optimized access control and an optimized routing are achieved. 



During the training phase, the weighting parameters are adapted to the training data applied to 
the function approximator. 



A risk parameter ic {Qis defined with the aid of which a desired risk, which the system has with 
regard to a prescribed state owing to a sequence of actions and states, can be set in accordance with 
the following rules: 



-1 < k {H}< 0: risky learning, 

k {{}= 0 : neutral learning with regard to the risk, 

0 < k {{}< 1: risk-avoiding learning, 

k {(}= 1 : worst-case learning. 
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Furthermore, a prescribable parameter 0 < X < {fH}1 and a step size sequence y m are 
prescribed in the learning method. 

The weighting values of the weighting vector 0 {Qare adapted to the training data on the basis 
of each event (o tk | in accordance with the following adaptation rule: 

<=>k = ©k-i + yk* K K)zt / (28) 

in which case 

d k = e" P ( t k- t k-l)( g ( Xtk , aJk , a tk ) + j(x tk , ©k-l)) " ^(xtk-i'Qk-l) 

(29) 



(30) 



and 

N K ($) - (l - Ksign(4))^. (31) 
It is assumed that: z., = 0. 

The function 

g( x t k . <*k' *t k ) (32) 
denotes the immediate gain in accordance with the following rule: 

] 

Thus, as described above, a sequence of actions is determined with regard to a connection 
request such that a connection request is either rejected or accepted on the basis of an action. The 
determination is performed taking account of an optimization function in which the risk can be set by 
means of a risk control parameter k e {H}[-1 ; 1] in a variable fashion. 
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{S e cond e xemp l ary ombod i mont: Traffic monagomont system) [Figure 6 shows an 
embodiment of the present invention in relation to a traffic management system] 

{Figur e 6 shows a } [A] road 600 on which automobiles 601, 602, 603, 604, 605 and 606 are 

being [ 
]driven.{ 

} Conductor loops 610, 61 1 integrated into the road 600 receive electric signals in a known way and 
feed the electric signals 615, 616 to a computer 620 via an input/output interface 621. In an analog-to- 
digital converter 622 connected to the input/output interface 621, the electric signals are digitized into 
a time series and stored in a memory 623, which is connected by a bus 624 to the analog-to-digital 
converter 622 and a processor 625. Via the input/output interface 621 , a traffic management system 
650 is fed control signals 651 from which it is possible to set a prescribed speed stipulation 652 in the 
traffic management system 650, or else further particulars of traffic regulations, which are displayed 
via the traffic management system 650 to drivers of the vehicles 601 , 602, 603, 604, 605 and 606. 

The following local state variables are used in this case for the purpose of traffic modeling: 

• {{}traffic flow rate v, 

Fz _ 

• {Qvehicle density p (p {{£= number of vehicles per kilometer ■). 

km 



i 

i 



Fz 

{^traffic flow q (q = number of vehicles per hour — ■, (q= v * p{4})) t and 

h 

{Qspeed restrictions 652 displayed by the traffic management system 650 at an instant in 
each case. 

The local state variables are measured as described above by using the conductor loops 610, 



611. 



These variables (v(t), p{Q(t), q(t)) therefore represent a state of the technical system of 'traffic* 
at a specific instant t. 
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In this { e xomplary] embodiment, the system is therefore a traffic system which is controlled by 
using the traffic management system 650[, and]f 

In this cocond oxomp l ary ombod i mont, ) an extended Q-learning method is described as method of 
approximative dynamic programming. 

The state x t is described by a state vector 
x(t>= (v(t), p(t> q(t)) . (34) 



The action a t denotes the speed restriction 652, which is displayed at the instant t by the traffic 
management system 650. { 

}The gain r(x t , a t , x t+1 ) describes the quality of the traffic flow which was measured between the 
instants t and t+1 by the conductor loops 610 and 61 1 . [ 

]'n this { se cond ox e mplary) embodiment, r(x t , a t , x t+i ) denotes 
{Qthe average speed of the vehicles in the time interval [t, t + 1] 

{(}the number of vehicles which have passed the conductor loops 610 and 61 1 in the time 
interval [t, t + 1] 

{{}the variance of the vehicle speeds in the time interval [t, t+1], 

{Oa weighted sum from the above variables. 

A value of the optimization function OFQ is determined for each possible action a t , that is to say 
for each speed restriction which can be displayed by the traffic management system 650, an 
estimated value of the optimization function OFQ being realized in each case as a neural network. 

This results in a set of evaluation variables for the various actions a t in the [ 



CO 

o 

\ li 

h 



or 



or 



or 



OA 



]system state x t .{ 

} Those actions a ( for which the maximum evaluation variable OFQ has been determined in the 
current system state x t are selected in a control phase from the possible actions a t , that is to say from 
the set of the speed restrictions which can be displayed by the traffic management system 650. 

In accordance with this { e x e mp l ary} embodiment, the adaptation rule, known from the Q- 
learning method, for calculating the optimization function OFQ is extended by a risk control function 
^{{QO. which takes account of the risk. 

In turn, the risk control parameter k {Qis prescribed in accordance with the strategy from the 
first exemplary embodiment in the interval of [-1{{-H}I*^*k*<*]1], and represents the risk which a user 
wishes to run in the application with regard to the control strategy to be determined. 

The following evaluation function OFQ is used in accordance with this exemplary embodiment: 



• {0 X = (v; p{{}; q) denoting a state of the traffic system, 

• {Qa denoting a speed restriction from the action space A of all speed restrictions which can 
be displayed by the traffic management system 650, and 

• {Gw a denoting the weights of the neural network which belong to the speed restriction a. 

The following adaptation step is executed in Q-learning in order to determine [ 
]the optimum weights w a of the neural network: 




(35) 



w 



t + 1 



= w 




(36) 



using the abbreviation^] 




max 
a €A 




(37) 



{0x t( x t+1 denoting in each case a state of the traffic system in accordance with rule (34), 



• {Qa t denoting an action, that is to say a speed restriction which can be displayed by the traffic 
management system 650, 

• y {f-(}denoting a prescribable reduction factor, 

• w t * | {{^denoting the weighting vector belonging to the action a t , before the adaptation 
step, 

• w t+1 | (Odenoting the weighting vector belonging to the action a t , after the adaptation step, 

• (t = 1 , ...) denoting a prescribable step size sequence, 

• k e {( ( (} [-1; 1] denoting a risk control parameter, 

X K {HOdenoting a risk control function N K (^[) = (1*-*KSign(4))5,] {( (() - (1 (sign(())(, 

• V Q (*;*) denoting the derivative of the neural network with respect to its weights, and 

• {{Mx t , a t. x t+ i) denoting a gain upon the transition in state from the state x t to the subsequent 
state x t+1 . 

An action a t can be selected at random from the possible actions a t during [ 
Jlearning. It is not necessary in this case to select the action a t which has led to the largest evaluation 
variable. 

The adaptation of the weights has to be performed in such a way that not only is [ 
]a traffic control achieved which is optimized in terms of the expectation of the optimization function, 
but that also account is taken of a variance of the control results. 

This is particularly advantageous since the state vector x(t) models the actual system of traffic 
only inadequately in some aspects, and so unexpected disturbances can thereby occur. Thus, the 
dynamics of the traffic, and therefore of its modeling, depend on further factors such as weather, 
proportion of trucks on the road, proportion of mobile homes, etc., which are not always integrated in 
the measured variables of the state vector x(t). In addition, it is not always ensured that the road 
users immediately implement the new speed instructions in accordance with the traffic management 
system. 
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A control phase on the real system in accordance with the traffic management system takes 
place in accordance with the following steps: 

1 . The state x t is measured at the instant t at various points in the traffic system of traffic and 
yields a state vector x(t): = (v(t), pft}(t), q(t)). 

2. A value of the optimization function is determined for all possible actions a t , and that action a t 
with the highest evaluation in the optimization function is selected. 

Abstract 

M e thod and arrangement for d e t e rmining a s e qu e nc e of act i ons for a syst e m which has stat e s, a 
trans i tion i n stat e b e tw ee n two stat e s b e ing perform e d on tho basic of an act i on 

Th e d e termination of a s e qu e nc e of actions is p e rform e d in such a way that a s e qu e nc e of stat es 
r es u l t i ng from th e s e qu e nc e of act i ons is optm i z e d w i th r e gard to a pr e scr i b e d opt i m i zation 
function. Th e optim i zat i on function inc l ud e s a var i abl e paramotor w i th tho aid of which it i s 
possib le to s e t a r i sk which tho rosu l t i ng soquonco of stat e s has with r e sp e ct to a pr e scr i bed 
state of tho systom.) [Although modifications and changes may be suggested by those 
skilled in the 

art to which this invention pertains, it is the intention of the inventors to embody 
within the patent warranted hereon all changes and modifications that may 
reasonably and properly come under the scope of their contribution to the art. - -] 
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